# Efficient visual encoding
Smolvlm Instruct GGUF
Apache-2.0
SmolVLM is a compact open-source multimodal model that can accept image and text inputs and generate text outputs. It is designed for high efficiency and is suitable for device-side applications.
Image-to-Text
Transformers English

S
Mungert
1,023
2
Fastvlm 0.5B Stage3
Other
FastVLM-0.5B-Stage3 is an efficient multimodal language model with visual understanding and language processing capabilities. It can process long videos and generate structured outputs.
Image-to-Text
Transformers English

F
zhaode
174
1
Fastvlm 0.5B Stage2
Other
FastVLM-0.5B-Stage2 is an efficient multimodal language model capable of understanding visual content and handling text tasks.
Multimodal Fusion
Transformers English

F
zhaode
103
1
Vit B 16 Aion400m E32 1finetuned 1
MIT
Vision Transformer model based on OpenCLIP framework, fine-tuned for zero-shot image classification tasks
Image Classification
V
Albe-njupt
18
1
Featured Recommended AI Models